Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add timedelta, timedelta64 and datetime64 plus respective conversions #509

Open
wants to merge 25 commits into
base: main
Choose a base branch
from

Conversation

hhaensel
Copy link
Contributor

@hhaensel hhaensel commented Jun 16, 2024

This PR replaces #334 and takes into account the major refactoring of PythonCall.
Particularly, it fixes #293.

What's new?

Python Constructors

julia> pytimedelta(hour = 1, minute = 2)
Python: datetime.timedelta(seconds=3720)

julia> pytimedelta64(hour = 1, minute = 2)
Python: numpy.timedelta64(62,'m')

julia> pytimedelta64(year = 2, month = 3)
Python: numpy.timedelta64(27,'M')

julia> pydatetime64(year = 2024, month = 3)
Python: numpy.datetime64('2024-03-01T00:00:00')

Conversion to Julian types

julia> x = pytimedelta64(year = 11)
Python: numpy.timedelta64(11,'Y')

julia> pyconvert(Any, x) |> x -> (x, typeof(x))
(11 years, Dates.CompoundPeriod)

julia> pyconvert(Period, x) |> x -> (x, typeof(x))
(Year(11), Year)

DataFrame handling

I've set the priority of datetime, timedelta, datetime64 and timedelta64 to ARRAY, which allows for automatic Table conversion - I hope that's the intended way to do it.

julia> jdf = DataFrame(x = [now() + Second(rand(1:1000)) for _ in 1:100], y = [Second(n) for n in 1:100])
100×2 DataFrame
 Row │ x                        y
     │ DateTime                 Second
─────┼──────────────────────────────────────
   12024-06-17T00:31:31.236  1 second
   22024-06-17T00:30:30.236  2 seconds
   32024-06-17T00:41:22.236  3 seconds
                    
  982024-06-17T00:36:05.236  98 seconds
  992024-06-17T00:38:38.236  99 seconds
 1002024-06-17T00:28:21.236  100 seconds
                             94 rows omitted

julia> pdf = pytable(jdf)
Python:
                         x               y
0  2024-06-17 00:31:31.236 0 days 00:00:01
1  2024-06-17 00:30:30.236 0 days 00:00:02
2  2024-06-17 00:41:22.236 0 days 00:00:03
3  2024-06-17 00:33:52.236 0 days 00:00:04
           ... 4 more lines ...
97 2024-06-17 00:36:05.236 0 days 00:01:38
98 2024-06-17 00:38:38.236 0 days 00:01:39
99 2024-06-17 00:28:21.236 0 days 00:01:40

[100 rows x 2 columns]

julia> DataFrame(PyTable(pdf))
100×2 DataFrame
 Row │ x                        y
     │ DateTime                 Compound
─────┼───────────────────────────────────────────────
   12024-06-17T00:31:31.236  1 second
   22024-06-17T00:30:30.236  2 seconds
   32024-06-17T00:41:22.236  3 seconds
                        
  982024-06-17T00:36:05.236  1 minute, 38 seconds
  992024-06-17T00:38:38.236  1 minute, 39 seconds
 1002024-06-17T00:28:21.236  1 minute, 40 seconds
                                      94 rows omitted

Default Conversion

I chose to use Dates.CompoundPeriod as result type of default conversion from timedelta64 as both types support year, month and minor period units. This is debatable, we could also change it to Period, hence the resulting type would depend on the input.

julia> pyconvert(Any, x) |> x -> (x, typeof(x))
(11 years, Dates.CompoundPeriod)

julia> pyconvert(Period, x) |> x -> (x, typeof(x))
(Year(11), Year)

Both Python and Julia do not convert between Year/Month and the other period types, so there is no danger with this choice to arrive at ill-determined intervals.
The difference is that Julia allows addition/subtraction of mixed types while Python/Numpy throws an error.

The difference to the previous PR is that all conversions rely on either builtin or numpy functions and do not use pandas.

Ordering of arguments for pytimedelta() was chosen to be identical to the python version, while ordering for pytimedelta64() is strictly descending, except week which comes last.

EDIT: add comments in what's new code, added conversions
EDIT2: removed comment about datetime_data, I had misunderstood the meaning and have updated the code

@hhaensel hhaensel changed the title Add timedelta, timedelta64 and datetime64 plus respective conversions Add timedelta64 and datetime64 plus respective conversions Jun 17, 2024
@hhaensel hhaensel changed the title Add timedelta64 and datetime64 plus respective conversions Add timedelta, timedelta64 and datetime64 plus respective conversions Jun 17, 2024
@hhaensel
Copy link
Contributor Author

@cjdoris what do you think about this PR?

@hhaensel
Copy link
Contributor Author

@cjdoris still hoping to get this integrated ...
Is there anything pending or unclear where I could support?

@MilesCranmer
Copy link
Contributor

MilesCranmer commented Aug 23, 2024

Just chiming in with some comments:

1: Could you add some tests for this? This adds a lot of new features so I think should have testing to cover everything. Ideally the test coverage of the diff should be 100%. It should also cover your intended usecase with DataFrame. It should also cover the behavior about:

The difference is that Julia allows addition/subtraction of mixed types while Python/Numpy throws an error.

as this seems subtle.

2: Is the 3-arg version of the @py needed? I don't see it being used anywhere. I think it should go into a different PR.

src/Convert/numpy.jl Outdated Show resolved Hide resolved
src/Convert/numpy.jl Outdated Show resolved Hide resolved
src/Convert/numpy.jl Outdated Show resolved Hide resolved
src/Convert/numpy.jl Outdated Show resolved Hide resolved
year::Int=_year, month::Int=_month, day::Int=_day, hour::Int=_hour, minute::Int=_minute, second::Int=_second,
millisecond::Int=_millisecond, microsecond::Int=_microsecond, nanosecond::Int=_nanosecond
)
pyimport("numpy").datetime64("$(DateTime(year, month, day, hour, minute, second))") + pytimedelta64(;millisecond, microsecond, nanosecond)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is pyimport("numpy") the correct API call, or is that just to be used in user packages?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw similar calls at different places in the package, so I took this approach. But I also wouldn't know how to code a timedelta64 without calling pyimport.
Please let me know if there's a better solution.

src/Convert/numpy.jl Outdated Show resolved Hide resolved
src/Convert/numpy.jl Outdated Show resolved Hide resolved
Comment on lines 82 to 83
T = types[findfirst(==(unit), units)]
pyconvert_return(CompoundPeriod(T(value * count)) |> canonicalize)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proper way to do this would be to use Base.Cartesian.@nif. That way you could write this code to avoid dynamic dispatch on types (which will be very slow).

Comment on lines 71 to 74
units = ("Y", "M", "W", "D", "h", "m", "s", "ms", "us", "ns")
types = (Year, Month, Week, Day, Hour, Minute, Second, Millisecond, Microsecond, Nanosecond)
T = types[findfirst(==(unit), units)]
pyconvert_return(DateTime(_base_datetime) + T(value * count))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to other comment – you should write this using Base.Cartesian.@nif over the types tuple to avoid dynamic dispatch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, tested with a julia function calls and found that the julia part is around 25ns, whereas the python call is around 1.5microsecond

src/Convert/numpy.jl Outdated Show resolved Hide resolved
src/Convert/numpy.jl Outdated Show resolved Hide resolved
_year::Int=0, _month::Int=0, _day::Int=0, _hour::Int=0, _minute::Int=0, _second::Int=0, _millisecond::Int=0, _microsecond::Int=0, _nanosecond::Int=0, _week::Int=0;
year::Int=_year, month::Int=_month, day::Int=_day, hour::Int=_hour, minute::Int=_minute, second::Int=_second, microsecond::Int=_microsecond, millisecond::Int=_millisecond, nanosecond::Int=_nanosecond, week::Int=_week)
pytimedelta64(sum((
Year(year), Month(month), # you cannot mix year or month with any of the below units in python, the error will be thrown by `pytimedelta64(::CompoundPeriod)`
Copy link
Contributor

@MilesCranmer MilesCranmer Aug 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment

you cannot mix year or month with any of the below units in python, the error will be thrown by pytimedelta64(::CompoundPeriod)

Should be presented to the user as a descriptive error message rather than a comment in the function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the comment isn't clear enough.
Python throws a well understandable descriptive error in case of wrong usage, so no need for us to do so. Agree?


function pyconvert_rule_timedelta64(::Type{CompoundPeriod}, x::Py)
unit, count = pyconvert(Tuple, pyimport("numpy").datetime_data(x))
value = reinterpret(Int64, pyconvert(Vector, x))[1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is reinterpret safe here? Is there a better alternative to use?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought, pyconvert creates a new Julia Vector which is not mapped onto Python data. If that would be the case, we'd need to wrap the vector by a copy().

Comment on lines 120 to 122
for T in (CompoundPeriod, Year, Month, Day, Hour, Minute, Second, Millisecond, Microsecond, Nanosecond, Week)
pyconvert_add_rule("numpy:timedelta64", T, pyconvert_rule_timedelta64, priority)
end
Copy link
Contributor

@MilesCranmer MilesCranmer Aug 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since Julia is unlikely to unroll this loop, you should use Base.Cartesian.@nexprs here to avoid dynamic dispatch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried my best, but I'm not sure how to test whether this will speed up things

Comment on lines 40 to 41
args = T .== (Day, Second, Millisecond, Microsecond, Minute, Hour, Week)
pydatetime64(x.value .* args...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably better to rewrite this with Base.Cartesian.@nif rather than doing a masked sum, since you know there will be only 1 element in the sum.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sum is dammed fast (16ns), and I couldn't beat it with a different version.

@hhaensel
Copy link
Contributor Author

Thanks for all the comments, I already learned some new things about Julia, which is always nice.

I will go through them the next days and once we've agreed on the solutions write the test functions.

Just want to re-raise my question from the previous PR.

Do you think, we should stick with CompoundPeriod as default conversion type?
Alternatively, we could go for Period instead.

@hhaensel
Copy link
Contributor Author

@cjdoris @MilesCranmer I'd really love to see these changes merged. Please let me know if there are any remaining things to be done from your end?

@MilesCranmer
Copy link
Contributor

Could you please add some tests and docs?

@MilesCranmer
Copy link
Contributor

Can you rebase on #583 to fix the tests?

@hhaensel
Copy link
Contributor Author

I did rebase, the errors come from the new julia version.
I already submitted one PR on pushfirst!, there's another issue with the doc function that I cannot reproduce on my laptop.

@MilesCranmer
Copy link
Contributor

I see, thanks. In the meantime can you add tests and docs for the code changes? Ideally we should have 100% test coverage.

@hhaensel
Copy link
Contributor Author

I did add tests for pytimedelta.
Tests of pytimedelta64 depend on numpy. But as far as I could see, you did not define a CondaPkg environment so that numpy or pandas could be tested. Moreover, I saw that none of the standard python API is documented by docstrings, e.g. pytime().
Therefore, I am a bit lost where to begin adding docstrings for my extensions.
I tested using

using CondaPk
CondaPkg.add("pandas")

which I think is a reasonable approach.
If you are of the same opinion, I could write a PR that fixes the two locations, where :pandas is missing in the current tests and then continue completing this PR so that finally it doesn't error anymore.

@MilesCranmer
Copy link
Contributor

Were the test files not committed? I don’t see any added tests

@hhaensel
Copy link
Contributor Author

I submitted #588 and #589. I propose to merge these first before continuing here with docstrings and tests.

@hhaensel
Copy link
Contributor Author

When writing the tests for pytimedelta and pytimedelta64 I realised that my choice for keyword argument naming might be confusing for users.
Currently it is

pytimedelta(second = 1)

but

pyimport("datetime").timedelta(seconds = 1)

My choice was inspired from the Dates API (e.g. Second(1))
I tend to append an 's' to the keywords, what do you think?

hhaensel added 2 commits January 20, 2025 16:23
append 's' to keywords in pytimedelta and pytimedelta64
@hhaensel
Copy link
Contributor Author

hhaensel commented Jan 20, 2025

Changed the keywords. Tests should pass as soon as the other two PRs are merged.
One missing error is Aqua's complaining about REPL being a stale dependency.

@hhaensel
Copy link
Contributor Author

Added functionality that the unit of pytimedelta64() is defined by the least unit specified, also for the case 0.
Moreover, the user can specify whether time difference should be canonicalized, default is false.
So

julia> pytimedelta64(Year(0))
Python: np.timedelta64(0,'Y')

julia> pytimedelta64(Week(0))
Python: np.timedelta64(0,'W')

julia> pytimedelta64(Millisecond(0))
Python: np.timedelta64(0,'ms')

and

julia> pyconvert(Dates.CompoundPeriod, pytimedelta64(Second(60)))
60 seconds

julia> pyconvert(Dates.CompoundPeriod, pytimedelta64(Second(60), canonicalize = true))
1 minute

and

julia> pyconvert(Dates.CompoundPeriod, pytimedelta64(Second(60)))
60 seconds

julia> PythonCall.Convert.CANONICALIZE_TIMEDELTA64[] = true
true

julia> pyconvert(Dates.CompoundPeriod, pytimedelta64(Second(60)))
1 minute

@hhaensel
Copy link
Contributor Author

I hope you find these changes consistent.

@hhaensel
Copy link
Contributor Author

There's an error for CondaPkg on Windows, but in principle, the tests pass.
Will try to find a better method for CondaPkg environment.

@hhaensel
Copy link
Contributor Author

I cherry-picked #589 in order to proceed with testing.
Now everything seems fixed for julia 1.11

  • added tests for pytimedelta, pytimedelta64 and pydatetime64
  • several fixes that came up during testing
  • CondaPkg environment seems reasonably solved by moving to a test/Project.toml with its own CondaPkg.toml

The docs are probably the only remaining topic.
Should I simply add docstrings? What is your recommendation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DataFrame(::PyPandasDataFrame) converts date & datetime to bytes
2 participants